Methods employing phase state analysis for use in speech synthesis and recognition

ABSTRACT

A computer-implemented method for automatically analyzing, predicting, and/or modifying acoustic units of prosodic human speech utterances for use in speech synthesis or speech recognition. Possible steps include: initiating analysis of acoustic wave data representing the human speech utterances, via the phase state of the acoustic wave data; using one or more phase state defined acoustic wave metrics as common elements for analyzing, and optionally modifying, pitch, amplitude, duration, and other measurable acoustic parameters of the acoustic wave data, at predetermined time intervals; analyzing acoustic wave data representing a selected acoustic unit to determine the phase state of the acoustic unit; and analyzing the acoustic wave data representing the selected acoustic unit to determine at least one acoustic parameter of the acoustic unit with reference to the determined phase state of the selected acoustic unit. Also included are systems for implementing the described and related methods.

CROSS REFERENCE TO A RELATED APPLICATION

This application claims the benefit of provisional patent applicationNo. 61/138,834, filed on Dec. 18, 2008, the disclosure of whichprovisional patent application is incorporated by reference herein.

COPYRIGHT NOTIFICATION

Portions of this patent application contain materials that are subjectto copyright protection. The copyright owner has no objection to thefacsimile reproduction by anyone of the patent document or the patentdisclosure, as it appears in the Patent and Trademark Office patent fileor records, but otherwise reserves all copyright rights whatsoever.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

(Not applicable.)

The present invention provides system-effected methods for analyzing,predicting, and/or modifying acoustic units of human utterances for usein speech synthesis and recognition. The invention also providescomputerized systems and software for implementing the inventivemethods. In addition, the present invention provides, inter alia,methods of computer implemented, automated, analysis of expressive humanutterances that are linguistically as well as cognitively meaningful.The results of the analysis can be employed in computerizedtext-to-speech synthesis or in computerized speech recognition or inboth synthesis and recognition. The inventive method disclosed hereinrelates generally to a method of quantifying and automating the analysisof expressive human utterances that are linguistically as well ascognitively meaningful, for use in text-to-speech synthesis or speechrecognition or both.

BACKGROUND OF THE INVENTION

The human voice communicates meaning and identity simultaneously.Typically, an expressive human voice emphasizes syllables, phrases, evenparagraphs, to clarify what is being said, and has unique voicecharacteristics that tell one who is speaking. One objective of speechsynthesis can be to create synthesized speech that communicates thevoice identity of the speaker and that speaks with rhythms, intonations,and articulations that are close to those of a human being.

Two known approaches for synthesizing speech, format based and,concatenation of acoustic units from voice recordings, have shortcomingsin this respect. While the concatenated approach using prerecordedspeech units can provide a generally identifiable voice, it is usuallyunable to simultaneously provide expressive voice emphases andintonations that enhance the listener's understanding of the text beingsynthesized as speech.

U.S. Patent Application Publication No. 2008/0195391 to Marple et al.describes a hybrid speech synthesizer, method and use, which includesembodiments comprising a hybrid of the known format and concatenationmethods for synthesizing speech. As described, speech synthesizerembodiments can predict, locate, and concatenate wave forms in sequenceto provide acoustic units for expressive utterances when a specifiedacoustic unit (or a close facsimile thereof) is found to exist in adatabase of acoustic units. When the predicted acoustic unit is notfound, the synthesizer can manipulate acoustic wave data for an acousticunit candidate that is close to the predicted values of the idealcandidate so as to create an ideal candidate, or a perceptuallyacceptable substitute.

U.S. patent application Ser. No. 12/188,763 to Nitisaroj et al.describes a method of automated text parsing and annotation forexpressive prosodies that indicates how the text is to be pronouncedwhich is useful in speech synthesis and voice recognition. Alsodescribed are the abilities of professional voice talents trained toproduce expressive speech according to annotations for a particularprosody in terms of articulations, with desired pitches, amplitudes, andrates of speech.

The foregoing description of background art may include insights,discoveries, understandings or disclosures, or associations together ofdisclosures, that were not known to the relevant art prior to thepresent invention but which were provided by the invention. Some suchcontributions of the invention may have been specifically pointed outherein, whereas other such contributions of the invention will beapparent from their context. Merely because a document may have beencited here, no admission is made that the field of the document, whichmay be quite different from that of the invention, is analogous to thefield or fields of the present invention.

SUMMARY OF THE INVENTION

The present invention provides, in one aspect, a computer-implementedmethod for analyzing, predicting, and/or modifying acoustic units ofprosodic human speech utterances for use in speech synthesis or speechrecognition. Broadly stated, this aspect of the inventive method cancomprise one or more steps selected from the following:

-   -   initiating analysis of acoustic wave data representing the human        speech utterances, via the phase state of the acoustic wave        data, the acoustic wave data being in constrained or        unconstrained form;    -   using one or more phase state defined acoustic wave metrics as        common elements for analyzing, and optionally modifying, one or        more measurable acoustic parameters selected from the group        consisting of pitch, amplitude, duration, and other measurable        acoustic parameters of the acoustic wave data, at predetermined        time intervals, two or more of the acoustic parameters        optionally being analyzed and/or modified simultaneously;    -   analyzing acoustic wave data representing a selected one of the        acoustic units to determine the phase state of the acoustic        unit; and    -   analyzing the acoustic wave data representing the selected        acoustic unit to determine at least one acoustic parameter of        the acoustic unit with reference to the determined phase state        of the selected acoustic unit.

A further aspect of the invention provides a computer-implemented methodcomprising matching a sequence of acoustic units comprising a speechsignal containing continuous prosodic human speech utterances with asequence of text capable of visually representing the speech in thespeech signal. The matching can comprise one or more of the method stepsdescribed herein.

In another aspect, the invention provides a method for categoricallymapping the relationship of at least one text unit in a sequence of textto at least one corresponding prosodic phonetic unit, to at least onelinguistic feature category in the sequence of text, and to at least onespeech utterance represented in a synthesized speech signal. The methodcan comprise one or more steps selected from the following:

-   -   identifying, and optionally modifying, acoustic data        representing the at least one speech utterance, to provide the        synthesized speech signal;    -   identifying, and optionally modifying, the acoustic data        representing the at least one utterance to provide the at least        one speech utterance with an expressive prosody determined        according to prosodic rules; and    -   identifying acoustic unit feature vectors for each of the at        least one prosodic phonetic units, each acoustic unit feature        vector comprising a bundle of feature values selected according        to proximity to a statistical mean of the values of acoustic        unit candidates available for matching with the respective        prosodic phonetic unit and, optionally, for acoustic continuity        with at least one adjacent acoustic feature vector.

A further aspect of the invention provides a method of mapping a textunit to a prosodic phonetic unit, which method comprises determiningindividual linguistic and acoustic weights for each prosodic phoneticunit according to linguistic feature hierarchies. The linguistic featurehierarchies can be related to a prior adjacent prosodic phonetic unit ina sequence of prosodic phonetic units and to a next adjacent prosodicphonetic unit in the sequence, and each candidate acoustic unit can havedifferent target and join weights for each respective end of thecandidate acoustic unit, although, in a suitable context the target andjoin weights for the ends of a prosodic phonetic unit can be similar, ifdesired.

The methods of the invention can include measuring one or more acousticparameters, optionally F0, F1, F2, F3, energy, and the like, across aparticular acoustic unit corresponding to a particular prosodic phoneticunit to determine time related changes in the one or more acousticparameters, and can include modeling the particular acoustic unit andthe relevant acoustic parameter values of the prior adjacent prosodicphonetic unit and the next adjacent prosodic phonetic unit.

Modeling can comprise: applying combinations of fourth-order polynomialsand second- and third-order polynomials to represent n-dimensionaltrajectories of the modeled acoustic units through unconstrainedacoustic space; or applying a lower-order polynomial to trajectories inconstrained acoustic space; and optionally can comprise modelingdiphones and triphones.

If desired, pursuant to further aspects of the invention, linguisticallyselected acoustical candidates can be employed for calculating acousticfeatures for synthesizing speech utterances from text. For example,linguistically selected acoustical candidates can be employed bycalculating an absolute and/or relative desired acoustic parametervalue, optionally in terms of fundamental frequency and/or a change infundamental frequency over the duration of the acoustic unit. Theduration can be represented by a single point, multiple points, Hermitesplines or any other suitable representation. The desired acousticparameter value can be based on a weighted average of the actualacoustic parameters for a set of candidate acoustic units selectedaccording to their correspondence to a particular linguistic context. Ifdesired, the weighting can favor acoustic unit candidates more closelycorresponding to the particular linguistic context.

A further aspect of the invention provides a method for assigninglinguistic and acoustic weights to prosodic phonetic units useful forconcatenation into synthetic speech or for speech recognition. Thismethod can comprise determining individual linguistic and acousticweights for each prosodic phonetic unit according to linguistic featurehierarchies related to a prior adjacent prosodic phonetic unit and to anext adjacent prosodic phonetic unit. Each candidate acoustic unit canhave different target and join weights for each respective end of thecandidate acoustic unit, if desired, or in some circumstances the targetand join weights can be the same.

The weight-assigning method can also include measuring one or moreacoustic parameters, optionally F0, F1, F2, F3, energy, and the like,across a particular acoustic unit corresponding to a particular prosodicphonetic unit to determine time related changes in the one or moreacoustic parameters and/or modeling the particular acoustic unit and therelevant acoustic parameter values of the prior adjacent prosodicphonetic unit and the next adjacent prosodic phonetic unit.

A still further aspect of the invention provides a method for deriving apath through acoustic space. The acoustic path can comprise desiredacoustic feature values for each sequential unit of a sequence ofacoustic units to be employed in synthesizing speech from text. Themethod can comprise calculating the acoustic path in absolute and/orrelative coordinates, for example, in terms of fundamental frequencyand/or a change in fundamental frequency over the duration of thesynthesizing of the text, for the sequence of acoustic units. Eachdesired sequential acoustic unit can be represented by a representation,such for example as a single point, multiple points, Hermite splines oranother suitable acoustic unit representation, according to a weightedaverage of the acoustic parameters of the acoustic unit representation.The weighted average of the acoustic parameters can be based on a degreeof accuracy with which the acoustic parameters for each suchsequentially desired acoustic unit are known, and/or on a degree ofinfluence ascribed to each sequential acoustic unit according to thecontext of the acoustic unit in the sequence of desired acoustic units.

Yet another aspect of the invention provides a method of deriving anacoustic path comprising a sequence of desired acoustic units extendingthrough unconstrained acoustic space, the acoustic path being useful forsynthesizing speech from text with a desired style of speech prosody byconcatenating the sequence of desired acoustic units. The method cancomprise one or more of the following steps or elements:

-   -   providing a database of acoustic units wherein each acoustic        unit is identified according to a prosodic phonetic unit name        and at least one additional linguistic feature; and wherein each        acoustic unit has been analyzed according to phase-state metrics        so that pitch, energy, and spectral wave data can be modified        simultaneously at one or more instants in time;    -   mapping each acoustic unit to prosodic phonetic unit        categorizations and additional linguistic categorizations        enabling the acoustic unit to be specified and/or altered to        provide one or more acoustic units for incorporation into        expressively synthesized speech according to prosodic rules;    -   calculating weighted absolute and/or relative acoustic values        for a set of candidate acoustic units to match each desired        acoustic unit, one candidate set per desired acoustic unit,        matching being in terms of linguistic features for the        corresponding mapped prosodic phonetic unit or a substitute for        the corresponding mapped prosodic phonetic unit;    -   calculating an acoustic path through n-dimensional acoustic        space to be sequenced as an utterance of synthesized speech, the        acoustic path being defined by the weighted average values for        each candidate set of acoustic units; and    -   selecting and modifying, as needed, a sequence of acoustic        units, or sub-units, for the synthesized speech according to the        differences between the weighted acoustic values for a candidate        acoustic unit, or sub-unit, and the weighted acoustic values of        a point on the calculated acoustic path.

Some aspects of the present invention enable the description of acousticdynamics that can be employed in expressive utterances of human speakersspeaking texts according to prosodic rules derivable from the text to bepronounced, from the relationship(s) of speaker(s) to listener(s) andfrom their motives for speaking and listening.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments of the invention, and of making and using theinvention, as well as the best mode contemplated of carrying out theinvention, are described in detail herein and, by way of example, withreference to the accompanying drawings, in which like referencecharacters designate like elements throughout the several views, and inwhich:

FIG. 1 is a graphical depiction of an impulse response obtainable byfiltering an acoustical speech-related signal with an impulse responseconvolution filter, useful in an analytical method embodiment of theinvention;

FIG. 2A is a graphical depiction of an envelope for a low pass filteredresidual signal obtained by inverse filtering of a signal from a singleword, the word “way”, the inverse filtering being useful in anotheranalytical method embodiment of the invention;

FIG. 2B is a graphical depiction of the corresponding instantaneousphase of the signal employed in FIG. 2A;

FIG. 3 is a graphical depiction of the unwrapped or unrolled phase ofthe signal employed in FIG. 2A;

FIG. 4 is a graphical depiction of an analytic signal in the complexdomain representing a portion of the word “way” depicted in FIG. 2A;

FIG. 5 is a graphical depiction of several processing stages, readingfrom the top to the bottom of the figure, of the marking of the signalemployed in FIG. 2A with pitch markers (top row) leading to thedetermination of the phase of the analytical signal (bottom row);

FIG. 6 is a graphical depiction of the distribution with respect tofrequency plotted on the abscissa and duration plotted on the ordinateof an available candidate population of prosodic phonetic units torepresent the leading “a” of the word “away” in the phrase “We were awaya week”;

FIG. 7 is a view similar to FIG. 6 wherein the population of prosodicphonetic units is segmented by pitch level;

FIG. 8 is a view similar to FIG. 6 wherein the population of prosodicphonetic units is segmented by tonality;

FIG. 9 is a view similar to FIG. 8 wherein the population of prosodicphonetic units is reclassified using a machine technique;

FIG. 10 is a view similar to FIG. 7 wherein the population of prosodicphonetic units is reclassified using a machine technique;

FIG. 11 is a view similar to FIG. 6 wherein a combination of linguisticfeatures is used to classify the population of prosodic phonetic units,according to a method embodiment of the invention;

FIG. 12 is a graphical depiction of an acoustic unit pathway throughacoustic space for the phrase “We were away a week” exemplifying afurther embodiment of the methods of the invention;

FIG. 13 is a three-dimensional graphical depiction of the acoustic unitpathway shown in FIG. 12;

FIG. 14 is a block flow diagram illustrating one embodiment of a speechsignal analysis method according to the invention; and

FIG. 15 is a block flow diagram illustrating one embodiment of a methodof reassembling a number of speech signal components produced accordingto method embodiments of the invention, as synthesized speech.

DETAILED DESCRIPTION OF THE INVENTION

Methods according to the invention can include concatenating a selectedacoustic unit with an additional acoustic unit having at least oneacoustic parameter compatible with the respective selected acoustic unitparameter as determined in a phase state of the additional acoustic unitsimilar to, or identical with, the phase state of the selected acousticunit. If desired, the additional acoustic unit can have a pitch,amplitude and duration compatible with the pitch, amplitude and durationof the selected acoustic unit.

If desired, method embodiments of the invention can include matching asequence of the acoustic units with a sequence of text capable ofvisually representing the speech utterances. Optionally, the speechutterances can have an identifiable prosody and the method can compriselabeling the text with prosodic phonetic units to represent the sequenceof text with the prosody identified in the speech signal.

Furthermore, method embodiments of the invention can include taggingeach prosodic phonetic unit with a bundle of acoustic feature values todescribe the prosodic phonetic unit acoustically. The acoustic featurevalues can comprise values for context-independent prosodic phoneticunit features and, optionally, context-dependent prosodic phonetic unitfeatures determined by applying linguistic rules to the sequence of text

Other useful steps that can be employed in the practice of theinvention, either alone or with one or more others of the steps listedor of the method steps and elements described elsewhere herein, includethe following:

-   -   assembling the sequence of acoustic units from available        acoustic units in a database of acoustic units wherein,        optionally, each available acoustic unit in the database can        comprise a recorded element of speech voiced by a human speaker;    -   determining a desired acoustic unit pathway comprising a        sequence of acoustic feature vectors corresponding with a        sequence of the acoustic units or with a sequence of text        representing the speech utterances;    -   employing for each acoustic feature vector a bundle of feature        values selected for closeness to a statistical mean of the        values of candidate acoustic units to be matched with a prosodic        phonetic unit or one of the prosodic phonetic units and,        optionally, for acoustic continuity with at least one adjacent        acoustic feature vector;    -   selecting an acoustic unit for the sequence of acoustic units        from the database of candidate acoustic units according to the        proximity of the feature values of the selected acoustic unit to        a corresponding acoustic feature vector on the acoustic unit        pathway and, optionally, for acoustic continuity of the selected        acoustic unit with the acoustic feature values of one or more        neighboring acoustic units in the sequence of acoustic units;    -   selecting each acoustic unit for the sequence of acoustic units        according to a rank ordering of acoustic units available to        represent a specific prosodic phonetic unit in the sequence of        text, the rank ordering being determined by the differences        between the feature values of available acoustic units and the        feature values of an acoustic feature vector on the acoustic        unit pathway; and    -   determining individual linguistic and acoustic weights for each        prosodic phonetic unit according to linguistic feature        hierarchies related to a prior adjacent prosodic phonetic unit        and to a next adjacent prosodic phonetic unit, wherein each        candidate acoustic unit can have different target and join        weights for each respective end of the candidate acoustic unit.

The methods of the invention can, if desired, employ one or more furthersteps or elements selected from the following:

-   -   modifying a candidate acoustic unit by analyzing the candidate        acoustic unit wave data into a vocal tract resonance signal and        a residual glottal signal, modifying an acoustic parameter of        the glottal signal and recombining the vocal tract resonance        signal with the modified glottal signal to provide a modified        candidate acoustic unit;    -   analyzing the glottal signal to determine the time-dependent        amplitude of the glottal signal with reference to the phase        state of the glottal signal, determining the fundamental        frequency of the glottal signal in the phase state and modifying        the fundamental frequency of the glottal signal in the phase        state to have a desired value; and    -   when the vocal tract resonance signal comprises partial        correlation coefficients, modifying the vocal tract resonance        signal by converting the partial correlation coefficients to log        area ratios and altering or interpolating the log area ratios.

In the methods provided by the invention and its various aspects, theacoustic metrics or acoustic parameters can comprise one or more metricsor parameters selected from the group consisting of pitch, amplitude,duration, fundamental frequency, formats, mel-frequency cepstralcoefficients, energy, and time.

Various optional features employable in the methods of the inventioninclude that:

-   -   the speech signal can be a synthesized speech signal comprising        a sequence of acoustic units concatenated to audibilize the        sequence of text;    -   that the acoustic units can be derived from prosodic speech        recordings generated by human speakers pronouncing text        annotated according to specific rules for a defined prosody;    -   that the sequence of text can be generated to visually represent        the speech recognized in the speech signal; and    -   that the sequence of text can be selected from the group        consisting of a phrase, a sentence, multiple sentences, a        paragraph, multiple paragraphs, a discourse, and a written work.

The invention also includes a computerized system comprising softwarefor performing any of the invention methods described herein. Thesoftware can be stored in, or resident in, computer readable media andoptionally can be running on the computerized system.

In one aspect the inventive method can utilize acoustic wave metrics formeasuring physical properties of acoustic waves, for example,fundamental frequency (F0), mel-frequency cepstral coefficients(“MFCCs”), energy, and time uttered by human speakers.

Initially, speakers can be trained professionals able to speakexpressively for specific prosodies as defined by the text to bepronounced. The text to be pronounced can be parsed using prosodic textparsing rules (disambiguation, part of speech tagging, syntacticalstructure, prosodic phrasing, lexical stress, simple semantics,discourse rules, etc.) and then annotated to be spoken according to oneor more specific prosodies derived automatically from the text to besynthesized, plus the speaker(s) relationship to the listener(s) andtheir predicted motives for speaking and listening.

Analytic methods according to the invention can use traditional acousticwave metrics in ways that can be simultaneously precise and applicablefor the purposes of: (1) automatically synthesizing prosodic speechdirectly from plain text input, or (2) for correctly recognizing textdirectly from expressively articulated speech utterances, or for bothsynthesis and recognition.

In another aspect of the invention, predictions of values for physicalwave metrics can be used to synthesize expressive prosodic speechdirectly from text. To recognize expressive human speech utterances andtranscribe it correctly as text, various measured values of physicalwaves can be analyzed for patterns corresponding to alternativeutterances of words, phrases and sentences according to identifiableprosodies and their linguistic dependencies.

Methods according to the invention can be trained, using machinelearning principles, to be applied to the voices of everyday persons,including their dialects and mispronunciations, along with any unusualvocal properties they may have due to various physical anomalies forexample gravelly, raspy, hoarse, high-pitched and the like. Also, somemethods according to the invention can be used to generate special vocaleffects and prosodies to be used by computer game avatars, sciencefiction entities, and stereotypical characters.

Individuals commonly vary their pronunciations for the same word withintheir own speech. Acoustic modeling based on recorded data can beunrewarding because of large variability in the acoustical properties ofthe speech utterances leading speech synthesis and speech recognitionpractitioners to “normalize” their acoustic data. Thus, speech synthesispractitioners may limit the speech synthesis range to substantially lessthan the normally expressive ranges of most human speakers who mayexhibit wide fluctuations of pitch, amplitude and duration for the samewords in their utterances. And speech recognition practitioners maypre-process speech, equalizing durations, reducing pitch change ranges,and limiting loudness excursions for lexical and phrasal stresses forphonemes and words. Such compromises can lead to poor results.

Aspects of the invention described herein can utilize acoustic utteranceunits, termed “prosodic phonetic units” herein which can communicateprosodic elements in a speech utterance and can be distinguished fromphonemes which do not usually communicate prosody or prosodic elements.As is described herein, prosodic phonetic units can be employed for theproduction of computer-articulated utterances wherein pitch, amplitude,duration and other acoustic parameters can be varied so as to yieldspeech utterances according to a particular prosody.

Considered another way, the term “prosodic phonetic unit” can beunderstood as designating a common symbol that can be interposed, in thecase of text-to-speech, between text to be spoken and an acoustic unituttered; and, in the case of speech to be recognized as text, betweenthe unit uttered and the text representing the utterance. The physicalspeech can be digitized and measured in terms of pitch, harmonics,formats, amplitude, duration, and so on. Further abstractions can bederived or calculated to represent the physical speech data in differentways, for example linear prediction coefficients, Hermite splines, andso on, and can be used to estimate or represent the amount ofmodification or adjustment that may be necessary for the output to beaccurately perceived as continuous speech, whether it be actual speechperceived by a machine or synthesized speech perceived by a humanlistener.

For example, there may be considered to be 54 phonemes in GeneralAmerican English. In contrast, methods according to the presentinvention can employ far greater numbers of prosodic phonenetic unitsfor a given style of prosodic speech. For example, in excess of 1,300uniquely identifiable prosodic phonetic units can be used forsynthesizing or recognizing a reportorial prosody in General AmericanEnglish.

A basic prosodic style in General American English is identified as“Reportorial” and is known as a style of expressive intonations,pronunciations, and speaking rhythms, such as exemplified by a NationalPublic Radio News announcer. Reportorial General American English isreferred to herein as “the base prosody”. If desired, suitable numbersof various spoken corpuses and prosodies available in a field, or alanguage, can be examined to arrive at a suitable number of prosodicphoneticprosodic phonetic units to provide in a database for practicingone or more of the methods of the invention in the respective field orlanguage. Prosodic phoneticProsodic phonetic units, as described herein,can be employed in the inventive methods disclosed herein.Alternatively, or in addition, each of these methods can be practicedwith the use of standard phonemes, if desired.

Any suitable number of prosodic phonetic units can be employed in thepractice of one or more of the methods of the invention, as will beunderstood by a person of ordinary skill in the art in light of thisdisclosure. For example, a useful database employed in the practice ofthe invention can be populated using about 600 uniquely identifiableprosodic phonetic units with the average prosodic phonetic unit havingabout 225 examples of its giving a total number of units in the databaseof approximately 135,000. These numbers are of course merelyillustrative and can be varied widely depending upon the application andthe available examples. Thus, some uniquely identifiable prosodicphonetic units may contain or be associated with 1,000 or more examplesof their use, and others may contain or be associated with less than 50,for example as few as five, ten or twenty. The number of prosodicphonetic units can be in a range of from about 100 to about 6,000 or7,000 or more. A number of prosodic phonetic units in a range of fromabout 300 to about 2,000 can be useful for some purposes. The totalnumber of units in the database can also vary and can, for example be ina range of from about 2,000, more desirably 20,000 to about 200,000 or300,000 or even higher. Thus, a comprehensive application addressing anumber of different prosodic styles could have as many as 1 or 2 millionor more acoustic units, perhaps up to about 15 million. Such a databasecould employ about 15,000 prosodic phonetic units each having about1,000 exemplary acoustic units, on average; which is merely an exampleof a larger database. The particular numbers can of course be varied, aswill be, or become, apparent to a person of ordinary skill in the art.

Other suitable combinations of prosodic phonetic units and exemplaryacoustic units may be used to arrive at an appropriately sized databasesize. Some useful determinants of suitability can include such items asthe expressive range in voiced pitch levels, articulatory rhythmsapplied to words and phrases, the particular number and styles ofprosodic speech, grammar and vocabulary, and the amount of modificationapplied to a prosodic acoustic unit.

Even when using an extensive database of prosodic phoneticprosodicphonetic units there may still be variations in the acoustic parametersof speech. Pursuant to the invention, acoustical parameter variationscan be managed by grouping acoustic units and comparing them accordingto their location in a linguistic hierarchy. Furthermore, some aspectsof the invention include methods of accounting for acoustic context insequence, such as: pitch level and location within an intonationprofile; durations of voiced and unvoiced acoustic units, as well as ofsilences; and, to variations for loudness, stress, and/or prominence.Methods according to the invention can enable prediction of specificoptimal acoustical parameters desired for a specific relevant context;this prediction can enable selection of a useful or optimal acousticunit from those units available to be used in expressive speechsynthesis. The acoustic unit prediction and selection can employprosodic phoneticprosodic phonetic units annotated according to prosodictext parsing and having linguistic labels. The method of the inventioncan also allow the selected unit to be modified to more closely matchspecific predicted optimal acoustic parameters. Embodiments of methodsaccording to the invention can provide prosodic speech synthesis thatreduces the listener's cognitive work-load and enhances her or hisperception and comprehension of the text being synthesized.

The recognition of prosodic phoneticprosodic phonetic units according topatterns of acoustic parameters relating to commonly used speechprosodies can improve speech recognition by correctly identifying moreof the text, as uttered, directly from the acoustic data, and can alsoprovide automated annotations to the text that indicate the emotionalstate of the speaker.

To create databases useful in the practice of the invention, trainedspeakers can be initially employed as voice models to provide a limitednumber of examples of each of various voices types, for example male,female, young, old, fundamental voice frequencies for speakers withlow-, middle- and high-pitched voices, etc. and so on, speaking each ofvarious prosodies. Samplings of text in a wide range of text genres canbe employed and predictions can be made of the maximum number ofprosodic phoneticprosodic phonetic units in actual use along with theirindicated frequencies of use and according to the various genres. Forexample, a database for a single voice speaking a text corpusrepresentative of General American English according to rules for abasic prosody can comprise from 3 to 10 hours of recorded speech. Eachadditional prosody spoken may require fewer hours of recorded speech assubsequent prosodies will likely include numerous pronunciations incommon with prior prosodies which do not require duplicating.

By employing sufficient numbers of speaking voices, and by using machinelearning techniques, if desired, the method can be generalized to: use afew hours of voice recordings of ordinary persons and then synthesizeprosodic speech from input text with a synthesized voice having thesound of the ordinary person's voice; and, with special effects formodifying voices, to craft distinctive voices for new fictionalcharacters.

The analysis of linguistically dependent acoustic units fromprosodically pronounced and recorded speech can be used to predictacoustic wave metrics for concatenation. In the event that a candidateacoustic unit does not completely fit an ideal, acoustic parametricmanipulations for one or more close candidates can be specified andevaluated to create a useable candidate that can be either identical to,or perceptually close enough to, the ideal candidate to be used inexpressive synthesis by an identifiable voice and in accordance with aspecific speaking prosody.

Methods for Analysis of Prosodic Speech

Concatenated speech synthesis often uses segments of recorded speechthat are broken into small utterances. Misalignment of the basic pitchlevel, F0, as well as the misalignment of the formats F1, F2, etc.between two short segments to be concatenated often yields perceivable“glitches” in the sound of the synthesized speech.

Known methods of providing text-to-speech synthesis software andacoustic unit databases often seek to reduce variations in pitch bycalling for a voice talent who reads the text to provide the recordedspeech, to speak with reduced intonation patterns (i.e. reduce thenumber and range of pitch changes). Also, post-processing techniques maybe employed to smooth the places where adjacent acoustic units arejoined. Joins are sometimes accomplished by working with the wave datain pitch synchronous form and performing some form of “overlap and add”process to smooth the join points. The result may be perceived asslightly “muddied” speech but nevertheless smoothed and not yieldingcomplaints about being distracting or incomprehensible. Similarly,durations of sounds and loudness are also addressed by post-processingprocedures being applied to the wave form units as concatenated.

Such methods may be unable to yield prosodic speech with meaningfulchanges in pitches, amplitudes, and durations. The inventive methodre-examined the requirements for changing pitch, amplitude, andduration, all instantaneously, by undertaking a general method forentering the process of wave data manipulation via the phase state asthe common element for undertaking simultaneous changes prior toconcatenation. One such method is further described below.

Analyzing Prosodic Speech Utterances

A real valued signal x(t) (as function of time t) can be extended to ananalytical signal in the complex domain by computing the Hilberttransform shown below as equation 1:

$\begin{matrix}{{y(t)} = {{\mathcal{H}( {x(t)} )} = {\frac{1}{\pi}{\int_{- \infty}^{\infty}{\frac{x(\tau)}{t - \tau}{\tau}}}}}} & ( {{Eq}.\mspace{14mu} 1} )\end{matrix}$

See, for example, Online Encyclopedia of Mathematics, ateom.springer.de.

In practical digital signal processing applications an approximation ofthe Hilbert transform of a signal can be obtained by a finite impulseresponse convolution filter using an impulse response as shown inFIG. 1. The impulse response shown in FIG. 1 is a digital approximationof the output of a Hilbert transform filter using a length of 121samples.

Using a complex valued combination of the original signal as the realpart and its Hilbert transform as the imaginary part, the analyticalsignal z(t)=x(t)+i y(t) is formed. This signal z(t), can be illustratedas a path in the complex plane, as is exemplified graphically in FIG. 4.If the analytical signal is understood as a moving point in the planedescribed by its distance from the origin and the angle relative to thereal axis, one obtains a polar form of the signal with the formalrepresentation:

z(t)=|z(t)|e ^(iΘ(t))  (Eq. 2)

In this representation, the complex value z(t) is understood as amomentary point in the complex domain at time t. Then, the amplitude|z(t)| corresponds to the distance from the origin and the phase Θ(t) tothe angle relative to the real axis. The amplitude and phase can beobtained by the following computations (Eq. 3)

$\begin{matrix}{{A(t)} = {{{z(t)}} = {{\sqrt{{x^{2}(t)} + {y^{2}(t)}}\mspace{14mu} {and}\mspace{14mu} {\Theta (t)}} = {\arctan \frac{y(t)}{x(t)}}}}} & ( {{Eq}.\mspace{14mu} 3} )\end{matrix}$

Thus, the instantaneous phase Θ(t) can be obtained by the inverse of thetrigonometric tangent function which transforms the ratio of theimaginary and real part of z(t) into an angle. An example of theinstantaneous amplitude of the signal |z(t)| is shown in FIG. 2A as theenvelope of the signal, and an example of the phase function Θ(t) isshown FIG. 2B. In FIG. 2A, amplitude on an ordinate scale of from 2 to−1.5 is plotted against time on an abscissa scale of from 0 to 0.35seconds. Referring to FIG. 2A, the signal and envelope |z(t)| for a lowpass filtered residual signal shown can be obtained by inversefiltering, integrating over time, and applying the Hilbert transform.The signal is from the word “way” with dropping pitch, followed by anaspirated sound for which the autoregressive model may no longer holdtrue and a larger error signal may appears.

In FIG. 2B the corresponding instantaneous phase Θ(t) can be seen. InFIG. 2B, phase in radians is plotted on an ordinate scale from π to −πis plotted against time on an abscissa scale of from 0 to 0.35 seconds.

Phase Unwrapping

The phase θ(t) can be obtained numerically only within a constant offsetof a multiple of 2π, because the inverse tangent function delivers onlyvalues in the interval [−π, π], resulting in a zigzag curve as shown inFIG. 2B. An unwrapped phase can be implicitly defined as a function,which, taken modulo 2π, gives the observed phase function:

Θ(t)=Θ_(u)(t)mod 2π  (Eq. 4)

Unwrapping thus consists of inverting this relationship, which cannot bedone analytically, starting from a digital signal representation ofθ(t). If the phase changes rapidly, phase unwrapping becomesproblematic, because for larger jumps it becomes ambiguous whether π or−π should be added. However, by exploiting the properties of theanalytical signal, a more reliable method can be found. In the digitaldomain, for two consecutive values of the signal (at sample times n andn+1) for the sample values x[n] and x[n+1], the corresponding samples ofthe analytical function are z[n]=x[n]+i y[n] and z[n+1]=x[n+1]+i y[n+1].If the phase θ[n] is known, the phase difference can be obtained by acomplex division, which in polar coordinates can be formally written as:

$\begin{matrix}{{{z\lbrack {n + 1} \rbrack}/{z\lbrack n\rbrack}} = {\frac{{z\lbrack {n + 1} \rbrack}}{{z\lbrack n\rbrack}}^{{({{\Theta {\lbrack{n + 1}\rbrack}} - {\Theta {\lbrack n\rbrack}}})}}}} & ( {{Eq}.\mspace{14mu} 5} )\end{matrix}$

The phase difference occurs in the exponent as θ[n+1]−θ[n], and itrepresents the increment in phase between time n and n+1. As long as theunderlying signal consists only of frequencies sufficiently below theNyquist frequency (½ of the sampling frequency), which can be the casefor the signals here considered, the relative phase increments should beless than ±π/2. The phase values can then be obtained simply by addingup the phase increments. Formally, a recursive scheme of computation forthe phase at time n+1 as shown in (Eq. 6) can be obtained, which followsfrom the algebraic representation of the complex division.

$\begin{matrix}{{\Theta \lbrack {n + 1} \rbrack} = {{\Theta \lbrack n\rbrack} + {\arctan ( \frac{{{x\lbrack n\rbrack}{y\lbrack {n + 1} \rbrack}} - {{x\lbrack {n + 1} \rbrack}{y\lbrack n\rbrack}}}{{{x\lbrack {n + 1} \rbrack}{x\lbrack n\rbrack}} + {{y\lbrack {n + 1} \rbrack}{y\lbrack n\rbrack}}} )}}} & ( {{Eq}.\mspace{14mu} 6} )\end{matrix}$

A meaningful starting value of the phase is to use θ[0]=0 for a timepoint that corresponds to a peak in the amplitude of the analyticsignal. From there on, the phase of all subsequent samples x[n] can becomputed recursively based on the analytic signal. This method can bereliable if the signal is sufficiently smooth, which can usually beachieved by using a sufficiently band limited signal.

An example of a useful method, employs a low pass filtered version ofthe residual signal obtained from the Burg lattice filter to providesmall phase increments. The input to the Burg lattice filter is a speechsignal which is filtered by a pre-emphasis filter, which emphasizeshigher frequencies to increase the accuracy of the partial correlationcoefficients representation, described elsewhere herein. The residualsignal is further processed by a leaky integrator to reverse thepre-emphasis. For the purpose of estimating an unrolled phase signal alow pass filtered version of this signal can be used. In the exampleillustrated, the signal can be obtained by applying a Butterworth lowpass filter with a cutoff frequency of 1200 Hz. FIG. 3 shows theunrolled phase of the signal and the slope of this curve is 2π times theinstantaneous fundamental frequency, which is to say, 2πF0

The unrolled phase can be used directly to obtain pitch pulse sequences.For the beginning of a voiced signal portion a point in time can besearched where the residual signal goes through a local energy maximum.The value of the unrolled phase at that point can be subtracted from theremaining unrolled phase signal, and pitch marks can be obtained as thetime stamps where this signal traverses a multiple of 2π. Based on thisanalysis, the analytic signal analysis of the glottal waveform can beused for identifying glottalization and period doubling or halving inrapid transitions (for example for rapidly lowering the fundamentalfrequency). Period doubling or halving may be associated with theappearance of a second order irregularity in the signal, so that theglottal signal has an alternating short and a long pitch pulse intervalas illustrated in FIGS. 4 and 5.

In FIG. 4, the analytic signal of the vowel transition for the letter“a” demonstrates period doubling. The two centers of convolution visiblecenter right of the figure suggest two fundamental frequencies and thatphase is shifting with respect to time.

Analysis of doubling of the period duration can be effected asillustrated in FIG. 5, beginning as shown in the topmost panel byapplying pitch markers to the speech signal, using for example Praatpitch marker software available from www.praat.org. In FIG. 5, eachpanel shows the change of signal phase from π to −π it with respect totime. Descending through the six panels the removal of tongue and moutheffects from the speech signal leaving the glottal effects of air on thevocal chords can be seen. The bottom panel shows the phase change to beregular throughout much of the glottal signal.

Methods for Predicting Acoustic Unit Paths to be Synthesized asContinuous Prosodic Speech

In some aspects of the inventive method, each unit in the acoustic database can be identified using its prosodic phonetic unit name, itslinguistic feature classification according to the text where theacoustic unit is located, and its directly measured acoustic features,computational representations or extensions thereof. In each case, thename, linguistic features, or acoustic features can be considered to beproperties of the unit. These methods of identification andclassification are:

(1) By prosodic phonetic unit name and other linguistic featuresdirectly derivable from the prosodic phonetic unit name. (prosodicphonetic units having the same name also have the same linguisticfeatures. For example, a prosodic phonetic unit may be an unvoicedplosive, or it may be a front vowel with a pitch level of 3.);(2) By sub- and supra-segmental linguistic features such as rising andfalling pitch levels, co-articulations, rhythms, and stresses thatderive from the position of a particular prosodic phonetic unit in thesequence of preceding and following prosodic phonetic units in the word,phrase, sentence, or paragraph to be pronounced or synthesized accordingto prosodic patterns or rules. These sub- and supra-segmental featurescan be determined by parsing the linguistic tree of the word, phrase,sentence, or paragraph for each prosodic phonetic unit therebyidentifying the specific linguistic context for the prosodic phoneticunit and adding the information about the feature to the identity of theacoustic unit corresponding to the prosodic phonetic unit (Linguisticfeatures such as phrase final, coda, prosodic phonetic unit “p” (prior)is a fricative, prosodic phonetic unit “n” (next) is a sonorant, etc.,are examples of this type of linguistic context identification featureapplying to the prosodic phonetic unit),(3) By acoustic related features that are measured acoustic parameterwave data directly related to a specific unit in the acoustic database(e.g., pitch F0, MFCCs, energy, duration, etc.) and corresponding to thespecifically named prosodic phonetic unit and its identified linguisticcontext.

To synthesize a specific text, the text can be run through an automatedprosody labeling engine. An example of a suitable prosody labelingengine is disclosed in patent application Ser. No. 12/188,763. Once thetext is labeled with prosodic phonetic units, the unit selection processfor synthesis can begin. In the unit selection process for synthesiseach prosodic phonetic unit can be associated with its specific bundleof feature values, both linguistically as text and acoustically as wavedata. Features can be predicted by the front end parsing engine'sanalysis of the linguistic parsing tree according to prosodic patternsor rules. The combined set of linguistic and acoustic features desiredfor a specific prosodic phonetic unit in a specific location in asequence of prosodic phonetic units is referenced herein as “the targetfeature bundle”.

For every specific prosodic phonetic unit to be synthesized this targetfeature bundle can be compared against the linguistic and acousticfeatures of possible candidate units contained in the acoustic database.The acoustic database contains a unique identifier for each candidateunit, along with categorical values for its associated linguistic andacoustic features. From the acoustic database the more important ofthese features can be used as pre-selection criteria, because there maybe relatively few units in the database that will exactly meet all ofthe criteria expressed in the target feature bundle.

This database query can return a larger number—Nc—of candidate units,for example, the upper limit of Nc can be set to a number such as 500,if desired, or to another suitable value. The target cost of each of thecandidates can be computed based on a linguistic feature hierarchy.Alternatively, if the system is not constrained to a limited number oflikely candidates, a specific measure of “badness” can be calculated forevery candidate unit in the database. Units with close to ideallinguistic features can have target costs of close to zero, and unitswith no matching linguistic features in terms of sound and linguisticcontext can have high target costs.

As an alternative to working with a single unified prosodic phoneticspace, methods and system embodiments of the invention can subdivide apopulation of prosodic phonetic units into multiple classes each ofwhich has one or more feature hierarchies. In one example, 11 differentclasses with 11 different feature hierarchies can be used. Othersuitable numbers of classes and hierarchies can also be employed, aswill be, or become, apparent to a person of ordinary skill in the art,for example, from 2 to about 50 classes with each class having from 2 toabout 50 different feature hierarchies. If desired, from 5 to about 20classes can be employed with each class having from about 5 to about 20different feature hierarchies.

In a relatively robust (non-sparse) database, the additional specificityprovided by multiple hierarchies (essentially strictly separatingcandidate selection spaces) has no impact on the units selected. In oneexample employing a population of 600 prosodic phonetic units, therecould be as many as 360,000 plus different hierarchies, i.e. a differenthierarchy for every possible prosodic phonetic unit adjacency (600prosodic phonetic units×600 prosodic phonetic units). It can be expectedthat additional hierarchies for more segmented classes (or additionalhierarchy depth and specificity in the case of a single unifiedhierarchy) will improve synthesis quality, but the number of hierarchiesin any real system will be relatively small. The feature hierarchydetermines the weights associated with each feature mismatch. If afeature is high in the hierarchy, a candidate unit can be penalized witha high cost when the feature for that specific candidate unit isdifferent from the desired target feature, while feature mismatcheslower in the hierarchy may be penalized to a lesser extent. This step ofthe evaluation yields a list of unit candidates (up to Nc in total)which can be rank ordered by sorting according to increasing targetcost. Thus, the best matches representing the lowest target cost comefirst, and the worse matches are near the end of the list and have hightarget costs (or penalties).

A person of ordinary skill in the art will understand that many methodscan be used to derive a suitable target cost from the categoricallinguistic features. One such method will now be described by way ofexample and without limitation.

Establish multiple categorical hierarchies such as the example shownbelow for Coda Sonorants:

1.1. Coda Sonorants

-   -   1.1.1. Non-phrase-final, next sound is NOT a sonorant        -   1. Current_Syllable_Stress (value of 0-2 based on three            states: stressed, unstressed, NA)            -   Current_Syllable_Pitchlevel (value of 0-3, or 6, based                on four states: 1, 2, 3, NA)            -   Current_Syllable_Inflectiontype (value of 0-3 based on                seven states: none, upglide, downglide, up circumflex,                down circumflex, level sustained, NA)        -   2. Previous_PPU*_Articulationplaceandmanner (value of 0-4,            or 16, based on system of 17 possible states of articulation            place and manner)            -   Previous_PPU_Vowelheight (value of 0-3 based on four                states: high/mid/low/NA)            -   Previous_PPU_Vowelfrontness (value of 0-3 based on four                states: front/central/back/NA)        -   3. Next_PPU_Articulationplaceandmanner (value of 0-4, or 16,            based on system of 17 possible states of articulation place            and manner)        -   4. Segment_Cluster (value of 0-2 based on three states: part            of consonant cluster, not part of consonant cluster, NA)        -   5. Next_Syllable_Stress (value of 0-2 based on three states:            stressed, unstressed, NA)            -   Next_Syllable_Pitchlevel (value of 0-3 based on four                states: 1, 2, 3, NA)            -   Next_Syllable_Inflectiontype (value of 0-3, or 6, based                on seven states: none, upglide, downglide, up                circumflex, down circumflex, level sustained, NA)        -   6. Previous_PPU_Name (value of 0-1 based on 2 states: match,            no match)        -   7. Next_PPU_Name (value of 0-1 based on 2 states: match, no            match)    -   Prosodic Phonetic Unit

This hierarchy only deals with coda sonorant sounds that are non-phrasefinal, where the next following sound is not a sonorant. There are sevenlevels in the hierarchy, with the linguistic features in level 1 alwaysmore important than the linguistic features in level 2 and below. Inother words, a candidate where the current syllable stress matches thedesired syllable stress (a level 1 feature), and nothing else matches inlevel 2 and below, is preferred to another candidate where nothingmatches in level 1, and everything matches for all features at level 2and below.

As can be seen from the outline above, this hierarchy does not need tobe strict, and can have multiple linguistic features on the same levelof the hierarchy. To determine a target cost, each possible candidatecan be compared against the desired optimal values for each candidate.The ideal candidate has a linguistic target cost of zero. In the exampleprovided in Table 1, below, the highest level of the hierarchy is level1, and the lowest level of the hierarchy is level 7. This tablerepresents one useful way of practicing a target cost penalty methodaccording to the invention. The penalties that are added to the targetcost at a specific level for each mismatch are based on being onepenalty value higher than the maximum penalty value possible if there iscomplete mismatch at all lower levels in the hierarchy. This method ofassigning target penalty values ensures that the categorical featurepreferences represented in the hierarchy are fully preserved in thetarget cost ranking. In other words, for this example hierarchy, theresulting penalty target values rank order the candidate units in termsof linguistic feature preference from zero—all linguistic featuresmatch, to potentially 1476—no linguistic features match.

TABLE 1 Exemplary penalty cost structure to convert hierarchicallinguistic categories into target costs Cost of Cost of Single Cost ofAll Features Hier- Number of Total Feature Maximum Mismatching archi-Linguistic Number of Mismatch at Mismatch at at Specfied cal FeaturesStates on Specified Specified Level and Level on Level Level Level LevelBelow 1 3 11 368 1109 1476 2 3 14 90 278 367 3 1 6 43 47 89 4 1 3 21 2242 5 3 11 4 17 20 6 1 2 2 2 3 7 1 2 1 1 1

The approach outlined above, and similar approaches that can beimplemented through tiered confusion matrices, and using other methodsknown to those skilled in the art, ensure that all of the linguisticpreference information captured in the hierarchy can be retained in thesingle target cost number.

The example above is not a strict hierarchy in that linguistic featuresthat are on the same level can be traded against each other. Forinstance, on level 2 of the example, previous prosodic phonetic unitvowel height and previous prosodic phonetic unit vowel frontness areequally important, so two candidate units with identical linguisticfeatures except for previous prosodic phonetic unit vowel height andfrontness would have identical target costs if one candidate unitmatched on previous prosodic phonetic unit vowel height and mismatchedon previous prosodic phonetic unit vowel frontness, and the othercandidate unit matched on previous prosodic phonetic unit vowelfrontness and mismatched on previous prosodic phonetic unit vowelheight. Uneven trading ratios are also possible, linguistic features onthe same level do not have to have the same penalty value for amismatch.

The acoustic database can be extended by including certain measurableacoustic features that are associated with each specific prosodicphonetic unit candidate. These measured acoustic features can beunderstood as quantitative representations of perceptual properties fora specific prosodic phonetic unit candidate. Since acoustic featurestypically change over time, the various acoustic features for prosodicphonetic unit candidates to be sequenced as synthesized utterances mayalso be seen as parameters that describe quantified path segments in theacoustic feature space. For each acoustic feature, one may develop amathematical model by calculating a smoothed polynomial or otherrepresentation, using one of the various curve fitting methods (e.g.initial and final values as well as rates of change vectors for Hermitesplines, or ten point samples over the duration of the prosodic phoneticunit candidate). Modeling may be implemented in many ways. One way, inorder to speed computation during speech synthesis, would be tocalculate curve coefficients for each prosodic phonetic unit candidateahead of time and store it as appended data in the database, so as to besimultaneously retrieved with the linguistic feature data and otheracoustic metrics for the wave data of the specific prosodic phoneticunit candidate.

The correspondence between linguistic features of prosodic phoneticunits, which are categorical, and acoustic features of uttered speech,which are quantitative and measurable, may only be exactly known for theacoustic units actually in the acoustic database of recorded sounds.Pursuant to a method aspect of the invention each acoustic unit waveformin the database can be labeled with a prosodic phonetic unit name pluslinguistic features relevant to the acoustic unit. Similarly, thederived acoustic feature vectors comprising F0, spectral coefficients asrepresented by MFCC's or other methods, duration, energy, and otheracoustic parameters, can also be measured and stored in the database, orthey can be computed ad-hoc from the wave data as needed.

In prosodic synthesis there are also occurrences where one enters plaintext to be synthesized as speech but none of the acoustic unitcandidates directly corresponds to the ideal linguistic features for aparticular prosodic phonetic unit, or a sequence of prosodic phoneticunits. In such circumstances various modifications of the acoustic wavesignal for an individual prosodic phonetic unit, or sequence of prosodicphonetic units, can be undertaken so that the perception of thesynthesized utterance closely approximates smooth and continuous speech.Often there is a general correspondence between linguistic features andacoustic feature vectors and this relationship can be quantified orunquantified.

For example, if one were to synthesize the initial short “a” in “away”,as contained in the sentence “We were away a week”, one would beginsearching among all short “a” candidates in the population of candidatesounds, or acoustic units, stored in an acoustic database. All forms ofshort “a” have “N4” as the initial orthographic item in the extendedprosodic phonetic unit name that is used to identify the correspondingacoustic unit in the acoustic database. FIGS. 6-11 show the results ofvarious approaches to identifying acoustic unit candidates forsynthesizing prosodic human speech in a population of acoustic units.

As can be seen from FIG. 6 such a prosodic phonetic unit description istoo general and therefore insufficient to predict the desired acousticfeatures for concatenation with much accuracy.

FIG. 6 shows the distribution of a base prosodic phonetic unit, labeledas ‘N4’ for a short ‘a’, across all of three pitch levels, low L1,medium L2 and high L3 employed as an approximate measure of pitch,tonalities low iL, and high iH, employed as an approximate measure ofintensity, and duration. The result is that there is insufficientinformation to predict suitable prosodic phonetic unit pitch andduration for representing or matching with the acoustic unit to besynthesized, the short “a”.

One can further classify and distribute the acoustic units according totheir more specific prosodic phonetic unit names. FIG. 7, shows thedistribution of all base ‘N4’ prosodic phonetic units further classifiedby their level of pitch (F0) according to levels 1, 2, and 3, usingdifferent symbols with different grayscale values to indicate prosodicphonetic units having the respective pitch level, as shown in the key inthe upper right hand corner of the figure. Similar keys are shown ineach of FIGS. 8-11. Since pitch is a perceptual measure, and isdependent upon linguistic context, there is substantial overlap amongthe L1, L2 and L3 labels for the prosodic phonetic unit N4. Again, thereis insufficient information to predict prosodic phonetic unit pitch andduration using pitch level.

One can also reclassify and distribute the base prosodic phonetic unitN4 according to low or high tonality, as shown in FIG. 8. An examinationof the distribution of prosodic phonetic unit N4 by tonality (iL, iH) asshown in FIG. 8, indicates substantial overlap, and again theinformation is not sufficient to make an accurate prediction of thedesired pitch and duration.

Such overlap problems might theoretically be addressed by variousmachine reclassification techniques. However, as shown in FIGS. 9 and10, one of these techniques, reclassification with a K-meansclassification algorithm, also fails to increase the amount ofprediction accuracy that can be derived from the acoustic database.

Acoustic Unit Candidate Identification and Selection Using LinguisticFeature Sets

Pursuant to the methods of the invention, a combined linguistic featureset for a specific prosodic phonetic unit can be used in creating astatistical model for identifying candidates in terms of acoustic unitfeature values for example fundamental frequency (f0), mel frequencycepstral coefficient (MFCC), duration, energy, etc. and, can in turn,use the weighted mean values of the identified candidates to predictacoustic feature values for an optimal candidate. This model may becreated separately from, or concurrently with, the synthesis process.Such a method of using linguistic parameters automatically derived fromplain text to be synthesized as speech can result in a more accurate orprecise prediction as is shown in FIG. 11.

Like FIGS. 6-10, FIG. 11 portrays the distribution of all short ‘a’vowels in the acoustic database. The base prosodic phonetic unit namebeginning with ‘N4’ is substantially equivalent to a short ‘a.’ Takentogether, the data in FIG. 11 are similar to what is shown in FIG. 6

However, the portion shown as black dots in FIG. 11 represent all short‘a’ vowels in the acoustic database that do not have the desired L1pitch level and iL Tonality; this black mass extends under the area ofdark gray asterisks. The dark gray area represents base N4 prosodicphonetic units having the desired L1 pitch level and iL Tonality. Thelight gray crosses represent the dark gray candidates that appear to bethe best N4 candidates based on the further application of categoricallinguistic features for the specific prosodic phonetic units. The blacksquare represents the weighted mean prediction using the best twentycandidates that are shown as light gray crosses.

This method, which, as stated, is useful for generating data such as areshown in FIG. 11, is further described below with reference to FIG. 12.However, simpler, order-based methods, or other weighting methods basedon a linguistic hierarchy, can be employed to predict target acousticunit parameters if desired. One example of a relatively simple weightingmethod is as follows:

-   -   Retrieve a large list (e.g. 500) of candidates from the database        based on a linguistic hierarchy or other preference method.    -   Rank order the candidates by order of preference.    -   Select a reasonable number of the top candidates (e.g. 20)    -   Weight each candidate by its rank order.    -   Retrieve the specific acoustic parameters associated with each        of the candidates (e.g. F0, MFCC's, Energy, Duration).    -   Calculate a weighted ideal target value for each acoustic        parameter based on the rank-order weighted mean (In the case of        20 candidates, this will give the top candidate a 9.5% weight,        and the 20^(th) candidate a 0.5% weight in calculating the        mean).

A sufficiently large acoustic unit database with limited empty sets atthe bottom of the linguistic hierarchy can also enable some of thesesimple weighting methods to be done ahead of time so that each candidateprosodic phonetic unit can be assigned to a specific small selectionbin, with specific synthesis candidates selected based on a tablelookup.

One aspect of the invention employs a speech synthesis system that canuse a more complex prosodic phonetic unit-based candidate prediction tobuild an optimization, in which heuristics can be first used to obtainstatistical parameters about good or best candidates for each prosodicphonetic unit of a sentence to be synthesized, and then an optimaltrajectory of feature vectors through the acoustic feature space can becalculated. An example of such a statistical model which can be utilizedfor providing the results shown in FIG. 11, as will be, or become,apparent to a person of ordinary skill in the art, will now bedescribed.

A fraction of the best scoring candidates in the candidate list can beselected and weighted sample means of the acoustic feature vectors canbe calculated. A starting list of Nc candidates (typically Nc=500, orfewer if there are not enough candidates) can be rank ordered accordingto their target cost, which depends on the differences between the setof target features and the actual features of each candidate in thedatabase. A fraction M of the highest ranking of the candidates,identified as having lowest target costs, can be chosen to estimatestatistical moments. First the target cost can be converted into anumber that has the properties of a probability measure, representingthe likelihood of being the best candidate. If c is the cost of acandidate a weight that is inversely proportional to c, for example aweight w=1/(1+c²), can be computed for each candidate. Other similarlogistics functions which result in the calculated weight beinginversely proportional to c can also be used to derive the appropriateweights. The weights are normalized so that they sum up to 1:

$\begin{matrix}{{\sum\limits_{i = 1}^{M}p_{i}} = 1} & ( {{Eq}.\mspace{14mu} 7} )\end{matrix}$

Indicating the acoustic feature vector of the i-th candidate as f andp_(i) its probability, the expected mean {circumflex over (f)} andcovariance C of the acoustic feature of the prosodic phonetic unit canbe computed as follows:

$\begin{matrix}{\hat{f} = {{\sum\limits_{i = 1}^{M}{p_{i}f_{i}\mspace{14mu} {and}\mspace{14mu} C}} = {\sum\limits_{i = 1}^{M}{{p_{i}( {f_{i} - \hat{f}} )}( {f_{i} - \hat{f}} )^{T}}}}} & ( {{Eq}.\mspace{14mu} 8} )\end{matrix}$

Furthermore, to represent correlations between subsequent prosodicphonetic unit candidates in two consecutive positions, denoted a and b,an (asymmetric) covariance between the acoustic features of theconsecutive prosodic phonetic units can be calculated, as follows:

$\begin{matrix}{C_{ab} = {\sum\limits_{i = 1}^{M_{a}}{\sum\limits_{j = 1}^{M_{b}}{{p_{ij}( {f_{a,i} - {\hat{f}}_{a}} )}( {f_{b,j} - {\hat{f}}_{b}} )^{T}}}}} & ( {{Eq}.\mspace{14mu} 9} )\end{matrix}$

Hereby the probability p_(ij) is non-zero only if the pair of two units(i,j) is connected, so the summation is taken only over the pairs ofconnected units, in which case the probability is the product of theprobabilities of each units, i.e. p_(ij)=p_(i) p_(j). For a typicalconstrained size database, all other covariance matrices of higher orderthat may exist between not directly adjacent prosodic phonetic units canbe ignored and assumed to be zero, since, in some cases, few examplesmay be found in the database to estimate these covariance matrices.

The complete covariance matrix for all acoustic features of a sentencewith L prosodic phonetic units can thus be obtained as a blocktri-diagonal matrix, representing all first order covariances of directneighbors on the off-diagonal and the intra-prosodic phonetic unitcovariances on the diagonal:

$\begin{matrix}{C = \begin{pmatrix}C_{11} & C_{12} & \; & \; & \; & \; \\C_{21} & C_{22} & C_{23} & \; & \; & \; \\\; & C_{32} & C_{33} & \ddots & \; & \; \\\; & \; & \ddots & \ddots & \; & \; \\\; & \; & \; & \; & C_{{L - 1},{L - 1}} & C_{{L - 1},L} \\\; & \; & \; & \; & C_{L,{L - 1}} & C_{LL}\end{pmatrix}} & ( {{Eq}.\mspace{14mu} 10} )\end{matrix}$

The expected sequence of feature vectors is a column vector obtained bystacking the prosodic phonetic unit specific feature vector expectationstogether:

{circumflex over (f)}=({circumflex over (f)} ₁ , {circumflex over (f)} ₂, . . . , {circumflex over (f)} _(L))^(T)  (Eq. 11)

At a join between two prosodic phonetic units, the expected featurevalues may be discontinuous, since they are only computed from thestatistics for each candidate list separately. However, for the acousticfeatures, with the exception of duration and other similar acousticparameters where the measurement may be only valid for the acoustic unitas a whole, continuity may be required when switching from one prosodicphonetic unit to the next. Therefore, a smooth optimal featuretrajectory can be computed by an optimization. A useful objective, forthe purposes of the present invention is to identify a feature sequencevector which for each segment is close or closest to the segment meanbut is at the same time continuous at the joins. For this purpose, thedifferences can be weighted with the inverse covariance matrix C. Asolution can be obtained by solving a system of equations of thefollowing form:

C ⁻¹ f+F ^(T) λ=C ⁻¹ {circumflex over (f)}

Ff=0  (Eq. 12)

In equation 12, the optimal feature vector is now denoted as f. Itcontains the optimal feature vectors for each prosodic phonetic unit inthe sentence. The matrix F represents the constraints; it is a sparsematrix containing only 1 and −1 values which is used to express theconstraints in the system, and λ represents a vector of Lagrangemultipliers, one for each constraint. The solution to the above systemof equations, Eq. 12, can be found by eliminating λ and solving for f.This can be formally written as:

f={circumflex over (f)}−CF(FCF ^(T))⁻¹ F{circumflex over (f)}  (Eq. 13)

It is possible to extend this method to include other externalconstraints by a combination of two methods: Algebraic constraints, inwhich a part of the optimal feature trajectory is prescribed to gothrough a specified target, can be taken into account by extending theconstraint matrix F and using a non-zero vector in the right hand sideof the 2^(nd) equation of Eq. 12. Other constraints that can berepresented by the requirement to minimize additional quadraticfunctions, can be formalized by modifying and extending the matrix C andthe right hand side of the first equation of Eq. 12.

Once the feature vectors for the optimal trajectory through the featurespace are known, each candidate can be evaluated against the optimaltrajectory. Target cost can now be calculated as the distance betweenthe candidate's feature vector and those of the optimal trajectory, andjoin cost can be computed by computing the distance of a pair ofadjacent candidates, to the left and right of each join, relative toeach other. In all distance measures, the weights can be obtained fromthe inverse of the covariance matrix between the relevant variables,which are part of the global covariance matrix. If this covariancematrix is denoted as D, the inner product of a vector of featuredifferences becomes the distance between two feature vectors f_(a) andf_(b) for consecutive prosodic phonetic units:

dist(f _(a) ,f _(b))=(f _(a) −f _(b))^(T) D ⁻¹(f _(a) −f _(b))  (Eq. 14)

For ease of illustration, only F0 and duration parameters in the pathcomputation and cost calculations can be used. For a full target oroptimal path prediction, additional measurable acoustic parameters suchas energy, MFCCs, etc., can be included. Thus, referring to FIG. 12,each feature vector f includes or consists of, the beginning and endingF0 value and rates of change for F0 values, as well as the duration ofthe segment. Continuity between two segments can be enforced for the F0values and rates of change of F0. In FIG. 12. thin dotted linesrepresent the M candidate trajectories. The thick gray segments are themean trajectory segments for each set of M candidates. The thick blackline is the estimated optimal trajectory.

An example of a suitable procedure for a particular synthesizedutterance “We were away a week” is shown in FIG. 12. Referring to FIG.12, the first 25 candidate trajectories for F0 are shown as thin dottedlines. The mean trajectories, obtained as weighted average of the splinecoefficients over each ensemble, are shown as thick gray discontinuousline segments. The continuous thick black line represents the estimatedoptimal combination of the trajectory segments which is continuous atthe joins.

In FIG. 13, possible results from this procedure are illustrated in athree-dimensional space-time representation in which the x and y axesare F0 and the rate of change of F0 in the xy plane, while the third zaxis represents time as a third dimension. Thin dotted lines representcandidate trajectories; thick discontinuous grey lines represent theweighted mean of each set of candidate trajectories, and the continuousthick black line represents the estimated target or optimal trajectory.

A general procedure for creating local models of the mapping fromfeature bundles to acoustic feature space may be extended to the case ofmissing or sparse data. If for a given prosodic phonetic unit only a fewcandidates exist, say 5 or even just one, reliable predictions aboutacoustic feature vectors from just a few examples cannot be made.However, if it is possible to make a reliable prediction of a particularacoustic feature, such as F0 contour or energy for a “similar” prosodicphonetic unit, or a set of prosodic phonetic units, this information canbe used either to optimally select from the sparse number of candidates,or as a method of predicting the desired acoustic features for possiblemodifications of an existing prosodic phonetic unit. For example, giventhat here are only 3 examples of a particular vowel in a given segmentaland prosodic context, the estimate of a salient acoustic feature, suchas F0 or energy, can be based on a likely larger set of examples wheresimilar vowels can be found in the same context.

A strategy of relaxing from a given prosodic phonetic unit to a moregeneral class of prosodic phonetic units can follow a predefined featurehierarchy which can be specified separately. In a preferred hierarchyapproach, the intrinsic features of prosodic phonetic units can beorganized in a tree, in which the terminals or leaves are specificprosodic phonetic units in a specific linguistic context. Relaxing afeature criterion then means going up in the tree from a given position.The highest point in the linguistic feature hierarchy tree discriminatesthe most important linguistic feature, in other words a candidate notmatching that specific desired linguistic feature may be unlikely toproduce acceptable synthesis. The lowest non-matching point in thelinguistic tree, one level above the terminal or leaf level may produceacceptable synthesis, possibly even if it does not match the desiredlinguistic feature attributes. The hierarchy does not need to be strict,different linguistic features can be at the same level of the hierarchy.In some linguistic contexts and for a limited set of prosodic phoneticunit classes, it may also be possible to substitute one prosodicphonetic unit for another. Effectively, this means that the candidatesused for acoustic feature prediction, and the candidates used forprosodic phonetic unit selection do not necessarily need to be containedin the same space.

Each acoustic unit in the database has as properties, in addition to itsown linguistic features and measured acoustic unit values, thelinguistic features, and measured or modeled acoustic unit values forthe prosodic phonetic unit that is prior to it in continuous speech(prosodic phonetic unit “p”) as well as for the prosodic phonetic unitthat is next (prosodic phonetic unit “n”). This allows one to useweighted change in pitch, duration or energy (or other acousticmeasures) for candidates based on the acoustic data for prosodicphonetic unit “p” (prior) and prosodic phonetic unit “n” (next) in theiroriginal continuous speech context to predict how much a specificcandidate should change relative to the unit at the end of the path towhich it will be concatenated. If desired, this concept can be extendedso that each acoustic unit has as properties, in addition to its ownlinguistic features and measured acoustic unit values, the linguisticfeatures, and measured or modeled acoustic unit values for the prosodicphonetic unit that is two or more positions prior to it in continuousrecorded speech in the database, as well as for the prosodic phoneticunit that is two or more positions following it in the continuousrecorded speech in the database. These relative properties of a unit canalso be based on the properties of a preceding and/or followingsyllable, word, phrase, sentence, or paragraph.

Methods of Modifying Segments of Uttered Speech Data to Create AcousticUnit Pathways for Synthesizing Continuous Prosodic Speech

In some cases, it can be expected that even after the best candidateshave been selected according to their proximity to the optimal path, theacoustic features may still be too discontinuous at the joins. For someof the acoustic features, in particular, F0, energy and duration, theoptimal trajectory can be used as a control signal. For example, if fortwo adjacent units F0 jumps by more than a small predetermined quantity(e.g. ½ semitone) one may compute a reference trajectory for F0 andmodify both units accordingly. Or, if the duration of a single unitdiffers significantly from the target duration, it may be elongated orshortened, using the information from the optimal trajectory. There aremany tools for effecting such changes. For example, the Burg latticefilter can be employed for analysis of the speech signal. Howeversuitable analyses can also be accomplished by other methods such as LPC,et. al., if desired. Below, one exemplary set of procedures according tothe invention for changing pitch and duration for voiced speech signals,in particular, is outlined.

The speech signal analysis method illustrated in FIG. 14 is one exampleof a useful method for separation of voice source and vocal tractproperties. Referring to FIG. 14, the speech signal, after being sentthrough a pre-emphasis filter (1) to boost higher frequencies for betterspectral estimation, is subsequently analyzed in a Burg lattice filter(2), which splits the signal into a set of slower varying partialcorrelation coefficients (“PARCOR” in the drawing figures) and aresidual signal that may contain information on the voice source.

The partial correlation coefficients, which for the case of vowels havea physical meaning as reflection coefficients (in terms of aone-dimensional acoustic model of the vocal tract as a tube consistingof a number of uniform segments of equal length with varyingcross-sections, in which acoustic waves are traveling back and forth andare reflected at segment boundaries). The partial correlationcoefficients or reflection coefficients r_(k) may be converted intologarithmic area ratios A_(k) by means of the transformation shown inequation 15:

$\begin{matrix}{A_{r} = {\log \frac{1 + r_{k}}{1 - r_{k}}}} & ( {{Eq}.\mspace{14mu} 15} )\end{matrix}$

This relationship between reflection coefficients and log area ratios(LARs) holds independently of the physical interpretation, and it hasbeen shown that LARs are sometimes better suited for interpolation thanreflection coefficients directly. This representation can be used forsignal modifications. The residual signal can be further analyzed, seeFIG. 14, following the dashed arrow: First, phase information about thesignal can be obtained by using a low pass filtered version (3) of thesignal, from which the fundamental frequency can more easily beobtained. The phase information associated with the low pass filteredsignal can be found by applying the Hilbert transform as a finiteimpulse response filter (4), followed by processing of the analyticsignal (5), as described elsewhere herein. The unrolled phase can thenbe used for two purposes: First, the times where the phase goes throughmultiples of 2π can be used as pitch markers. Further, since therelation between time and unrolled phase is monotonic, it can beinverted: For inversion, the unrolled phase is first replaced by asmoothed version obtained by piecewise Hermite spline functions, leadingto representation of time formally as function of phase. Thus, acontinuous time function that is associated with or computed from thespeech signal, for example F0, energy, MFCCs, log area ratios, and othermeasureable acoustic wave data can also be represented as a function ofphase.

The processing provides, for the duration of a prosodic phonetic unit, aset of polynomial coefficients that describe not only the time values ofpitch pulses but also values for other spectral parameters (e.g., LARs)as a function of phase. As is further described herein, the jittersignal, together with the polynomial representation, can providesuitable pitch marker time values.

Over the duration of a prosodic phonetic unit, the number of pitchpulses can usually be determined from the data. Thus, using a normalizedphase parameter γ in the interval from 0 to 1 as input, the polynomialrepresenting time as function of phase computes the time points of pitchpulses for every increment 1/n of γ, where n is the number of pitchcycles over the entire duration of a prosodic phonetic unit.Furthermore, other sets of polynomial coefficients deliver the spectralcoefficients as a function of phase (which implies being a function oftime). Modification of the polynomial coefficients can be a startingpoint to enable simultaneous modification of F0 and duration, as well asother acoustic parameters.

Representation by piecewise polynomials is not exact, since the mappingfrom phase to time is obtained by smoothing. However, the sequence ofsmall time differences between the smoothed pitch pulse sequence and theactually measured sequence can be stored as a jitter signal. Forchanging fundamental frequency and duration during synthesis thepolynomial representation of time as a function of phase can be alteredand alternative pitch pulse sequences, associated with the same value ofthe normalized phase parameter γ, can be computed.

To restore the natural fluctuations of pitch during synthesis, themodified pitch pulse sequence can be altered by adding the jitter, or aninterpolated version of the jitter, to the calculated pitch pulsesequence. An example of this method, as shown in FIG. 15 (box a) can beperformed with little if any loss of information: In particular, for thecase that the pitch pulse sequence is generated from the smoothedphase-to-time polynomial mapping, adding the jitter restores theoriginal pitch pulse sequence.

A pitch synchronous wavelet analysis and re-synthesis (FIG. 15, item b)of the residual signal can be used to facilitate regenerating theresidual signal at altered fundamental frequency, using a combination ofthe inverse wavelet transform with a Laguerre transform, for example asis described and shown in U.S. Patent Application Publication No.2008/0195391. The Laguerre transform makes it possible to systematicallystretch or contract partial signal windows that are centered by a pitchpulse. An implicitly made assumption for the validity of the signaldecomposition can be that the residual signal has a flat spectrum andcontains little information on the vocal tract resonances, so that amodification of the fundamental frequency does not significantly changethe vocal tract resonances. The spectral information about changingvocal tract resonances, which may be mainly associated witharticulation, is contained in the log area ratios. Using the original oraltered log area ratios, the spectral envelope information can berestored by the inverse lattice filter (see box c in FIG. 15).Alterations and interpolations between different sets of log area ratiosalso make it possible, within limits, to modify perceived articulation.The resultant output, as shown in FIG. 15, can be a prosodic synthesizedspeech signal having a human or humanized sound.

The invention includes machines for analysis, modification and/orprediction of acoustic units useful in speech synthesis as well asmachines for synthesizing speech from text or other graphic characters,and for recognizing utterances of human speech and rendering theutterances as text or other graphic characters, which machines compriseone or more suitable computerized devices employing or accessingsoftware stored in computer readable media including random accessmemory, the software being capable of performing or implementing one ormore of the methods described herein.

The disclosed invention can be implemented using various general purposeor special purpose computer systems, chips, boards, modules or othersuitable systems or devices as are available from many vendors. Oneexemplary such computer system includes an input device such as akeyboard, mouse or screen for receiving input from a user, a displaydevice such as a screen for displaying information to a user, computerreadable storage media, dynamic memory into which program instructionsand data may be loaded for processing, and one or more processors forperforming suitable data processing operations. The storage media maycomprise, for example, one or more drives for a hard disk, a floppydisk, a CD-ROM, a tape or other storage media, or flash or stick PROM orRAM memory or the like, for storing text, data, speech and software orsoftware tools useful for practicing the invention. Alternatively, or inaddition, remote storage may be employed for data and/or programs, whichmay be retrieved as needed, for example across the internet.

The computer system may be a stand-alone personal computer, aworkstation, a networked computer or may comprise distributed processingdistributed across numerous computing systems, or another suitablearrangement as desired. Files and programs useful in implementing themethods of the invention can be located on the computer systemperforming the processing or at a remote location.

Software useful for implementing or practicing the invention can bewritten, created or assembled employing commercially availablecomponents, or a suitable programming language, for example MicrosoftCorporation's C/C++ or the like. Also by way of example, Carnegie MellonUniversity's LINK GRAMMAR text parser and the Stanford University Partof Speech Tagger can be employed in text parsing, as can otherapplications for natural language processing that are known or becomeknown to a person of ordinary skill in the art, for example, dialogsystems, automated kiosks, automated directory services, and other toolsor applications. Such software, adapted, configured or customized toperform any of the processes of the invention can be implemented on ageneral purpose computing device or computing machine, or a dedicated ora customized computing device or machine to provide a special purposespeech synthesis or speech recognition device or machine having any oneor more of the particular features and elements described herein.

Various embodiments of the invention can be useful for the generation ofappealing, humanized machine speech for a wide range of applications,including audio or spoken books, magazines, newspapers, drama and otherentertainment, voicemail systems, electronically enabled appliances,automobiles, computers, robotic assistants, games and the like.

Such embodiments of the invention can express messages and othercommunications in any one or more of a variety of expressive prosodystyles including, but are not limited to, reportorial, persuasive,advocacy, human interest, excited, serious, poetic, and others. Otherembodiments of the invention can help train speakers to speak with adesired style or can modify the expressiveness of uttered speech, andoptionally, transform the speech to have a different prosodic style.

DISCLOSURES INCORPORATED

The entire disclosure of each and every United States patent and patentapplication, each foreign and international patent publication, of eachother publication and of each unpublished patent application that isspecifically referenced in this specification is hereby incorporated byreference herein, in its entirety. Should there appear to be conflictbetween the meaning of a term employed in the description of theinvention in this specification and with the usage in materialincorporated by reference from another document, the meaning as usedherein is intended to prevail.

The foregoing detailed description is to be read in light of and incombination with the preceding background and invention summarydescriptions wherein partial or complete information regarding the bestmode of practicing the invention, or regarding modifications,alternatives or useful embodiments of the invention may also be setforth or suggested, as will be apparent to one skilled in the art. Thedescription of he invention is intended to be understood as includingcombinations of the various elements of the invention, and of theirdisclosed or suggested alternatives, including alternatives disclosed,implied or suggested in any one or more of the various methods,products, compositions, systems, apparatus, instruments, aspects,embodiments, examples described in the specification or drawings, ifany, and to include any other written or illustrated combination orgrouping of elements of the invention or of the possible practice of theinvention, except for groups or combinations of elements that will be orbecome apparent to a person of ordinary skill in the art as beingincompatible with or contrary to the purposes of the invention.

Throughout the description, where methods or processes are described ashaving, including, or comprising specific process steps, it iscontemplated that the processes of the invention can also consistessentially of, or consist of, the recited processing steps. It shouldbe understood that the order of steps or order for performing certainactions is immaterial so long as the invention remains operable.Moreover, two or more steps or actions may be conducted simultaneously.

While illustrative embodiments of the invention have been describedabove, it is, of course, understood that many and various modificationswill be apparent to those of ordinary skill in the relevant art, or maybecome apparent as the art develops, in the light of the foregoingdescription. Such modifications are contemplated as being within thespirit and scope of the invention or inventions disclosed in thisspecification.

1-10. (canceled)
 11. A method for categorically mapping the relationshipof at least one text unit in a sequence of text to at least onecorresponding prosodic phonetic unit, to at least one linguistic featurecategory in the sequence of text, and to at least one speech utterancerepresented in a synthesized speech signal, the method comprising: (a)identifying, and optionally modifying, acoustic data representing the atleast one speech utterance, to provide the synthesized speech signal;(b) identifying, and optionally modifying, the acoustic datarepresenting the at least one utterance to provide the at least onespeech utterance with an expressive prosody determined according toprosodic rules; and (c) identifying acoustic unit feature vectors foreach of the at least one prosodic phonetic units, each acoustic unitfeature vector comprising a bundle of feature values selected accordingto proximity to a statistical mean of the values of acoustic unitcandidates available for matching with the respective prosodic phoneticunit and, optionally, for acoustic continuity with at least one adjacentacoustic feature vector.
 12. A method according to claim 11 comprisingselecting an acoustic unit for the sequence of acoustic units from thedatabase of candidate acoustic units according to the proximity of thefeature values of the selected acoustic unit to a corresponding acousticfeature vector on the acoustic unit pathway and, optionally, foracoustic continuity of the selected acoustic unit with the acousticfeature values of one or more neighboring acoustic units in the sequenceof acoustic units.
 13. A method according to claim 11 comprisingselecting each acoustic unit for the sequence of acoustic unitsaccording to a rank ordering of acoustic units available to represent aspecific prosodic phonetic unit in the sequence of text, the rankordering being determined by the differences between the feature valuesof available acoustic units and the feature values of an acousticfeature vector on the acoustic unit pathway.
 14. A method according toclaim 11 comprising determining individual linguistic and acousticweights for each prosodic phonetic unit according to linguistic featurehierarchies related to a prior adjacent prosodic phonetic unit and to anext adjacent prosodic phonetic unit, wherein each candidate acousticunit can have different target and join weights for each respective endof the candidate acoustic unit.
 15. A method according to claim 14comprising: (d) measuring one or more acoustic parameters, optionallyF0, F1, F2, F3, energy, and the like, across a particular acoustic unitcorresponding to a particular prosodic phonetic unit to determine timerelated changes in the one or more acoustic parameters; and, (e)modeling the particular acoustic unit and the relevant acousticparameter values of the prior adjacent prosodic phonetic unit and thenext adjacent prosodic phonetic unit.
 16. A method according to claim 15wherein the modeling comprises applying combinations of fourth-orderpolynomials and second- and third-order polynomials to representn-dimensional trajectories of the modeled acoustic units throughunconstrained acoustic space; or comprises applying a lower-orderpolynomial to constrained acoustic space; and optionally diphones andtriphones.
 17. A method according to claim 11 comprising usinglinguistically selected acoustical candidates for calculating acousticfeatures for synthesizing speech utterances from text by: calculating anabsolute and/or relative desired acoustic parameter value, optionally interms of fundamental frequency and/or a change in fundamental frequencyover the duration of the acoustic unit, the duration optionally beingrepresented by a single point, multiple points, Hermite splines oranother suitable representation, the desired acoustic parameter valuebeing based on a weighted average of the actual acoustic parameters fora set of candidate acoustic units selected according to theircorrespondence to a particular linguistic context, wherein the weightingfavors acoustic unit candidates more closely corresponding to theparticular linguistic context.
 18. A method for assigning linguistic andacoustic weights to prosodic phonetic units useful for concatenationinto synthetic speech or for speech recognition, the method comprising:determining individual linguistic and acoustic weights for each prosodicphonetic unit according to linguistic feature hierarchies related to aprior adjacent prosodic phonetic unit and to a next adjacent prosodicphonetic unit, wherein each candidate acoustic unit can have differenttarget and join weights for each respective end of the candidateacoustic unit; measuring one or more acoustic parameters, optionally F0,F1, F2, F3, energy, and the like, across a particular acoustic unitcorresponding to a particular prosodic phonetic unit to determine timerelated changes in the one or more acoustic parameters; and, modelingthe particular acoustic unit and the relevant acoustic parameter valuesof the prior adjacent prosodic phonetic unit and the next adjacentprosodic phonetic unit.
 19. A method for deriving a path throughacoustic space, the acoustic path comprising desired acoustic featurevalues for each sequential unit of a sequence of acoustic units to beemployed in synthesizing speech from text, the method comprising:calculating the acoustic path in absolute and/or relative coordinates,optionally in terms of fundamental frequency and/or a change infundamental frequency over the duration of the synthesizing of the text,for the sequence of acoustic units, each desired sequential acousticunit being represented by a representation, optionally a single point,multiple points, Hermite splines or another suitable acoustic unitrepresentation, according to a weighted average of the acousticparameters of the acoustic unit representation, wherein the weightedaverage is based on a degree of accuracy with which the acousticparameters for each such sequentially desired acoustic unit are known,and on a degree of influence ascribed to each sequential acoustic unitaccording to the context of the acoustic unit in the sequence of desiredacoustic units. 20.-22. (canceled)
 23. A method of deriving an acousticpath comprising a sequence of desired acoustic units extending throughunconstrained acoustic space, the acoustic path being useful forsynthesizing speech from text with a desired style of speech prosody byconcatenating the sequence of desired acoustic units, the methodcomprising: (a) providing a database of acoustic units wherein eachacoustic unit is identified according to a prosodic phonetic unit nameand at least one additional linguistic feature; and wherein eachacoustic unit has been analyzed according to phase-state metrics so thatpitch, energy, and spectral wave data can be modified simultaneously atone or more instants in time; (b) mapping each acoustic unit to prosodicphonetic unit categorizations and additional linguistic categorizationsenabling the acoustic unit to be specified and/or altered to provide oneor more acoustic units for incorporation into expressively synthesizedspeech according to prosodic rules; (c) calculating weighted absoluteand/or relative acoustic values for a set of candidate acoustic units tomatch each desired acoustic unit, one candidate set per desired acousticunit, matching being in terms of linguistic features for thecorresponding mapped prosodic phonetic unit or a substitute for thecorresponding mapped prosodic phonetic unit; (d) calculating an acousticpath through n-dimensional acoustic space to be sequenced as anutterance of synthesized speech, the acoustic path being defined by theweighted average values for each candidate set of acoustic units; (e)selecting and modifying, as needed, a sequence of acoustic units, orsub-units, for the synthesized speech according to the differencesbetween the weighted acoustic values for a candidate acoustic unit, orsub-unit, and the weighted acoustic values of a point on the calculatedacoustic path. 24.-29. (canceled)